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Abstract 

Support Vector Machines Regression (SVMR) is a regression technique which has been 
recently introduced by V. Vapnik and his collaborators (Vapnik, 1995; Vapnik, Golowich 
and Srnola, 1996). In SVMR the goodness of fit is measured not by the usual quadratic 
loss function (the mean square error), but by a different loss function called Vapnik'’s e- 
insensitive loss function , which is similar to the “robust” loss functions introduced by 
Huber (Huber, 1981). The quadratic loss function is well justified under the assumption 
of Gaussian additive noise. However, the noise model underlying the choice of Vapnik’s 
loss function is less clear. In this paper the use of Vapnik’s loss function is shown to be 
equivalent to a model of additive and Gaussian noise, where the variance and mean of the 
Gaussian are random variables. The probability distributions for the variance and mean 
will be stated explicitly. While this work is presented in the framework of SVMR, it can be 
extended to justify non-quadratic loss functions in any Maximum Likelihood or Maximum 
A Posteriori approach. It applies not only to Vapnik’s loss function, but to a broader class 
of loss functions. 
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1 Introduction 


Recently, there has been a growing interest in a novel function approximation technique called 
Support Vector Machines Regression (SVMR) [7, 9]. This technique was motivated by the 
framework of statistical learning theory, and also has a close relationship with the classical regu¬ 
larization theory approach to function approximation. One feature of SVMR is that it measures 
the interpolation error by Vapnik’s e-Insensitive Loss Function (ILF), a loss function similar to 
the “robust” loss functions introduced by Huber [4], While the quadratic loss function commonly 
used in regularization theory is well justified under the assumption of Gaussian, additive noise, 
it is less clear what precise noise model underlies the choice of Vapnik’s ILF. Understanding the 
nature of this noise is important for at least two reasons: 1) it can help us decide under which 
conditions it is appropriate to use SVMR rather than regularization theory; and 2) it may help 
to better understand the role of the parameter e, which appears in the definition of Vapnik’s 
ILF, and is one of the two free parameters in SVMR. 

In this paper we demonstrate the use of Vapnik’s ILF is justified under the assumption that the 
noise affecting the data is additive and Gaussian, but its variance and mean are random variables 
whose probability distributions can be explicitly computed. The result is derived by using the same 
Bayesian framework which can be used to derive the regularization theory approach, and it is an 
extension of existing work on noise models and “robust” loss functions [1]. 

The plan of the paper is as follows: in section 2 we briefly review SVMR and Vapnik’s ILF; 
in section 3 we introduce the Bayesian framework necessary to prove our main result, which is 
shown in section 4; in section 5 we show some additional results which relate to the topic of 
robust statistics, while section 6 summarizes the paper. 


2 Vapnik’s e-insensitive loss function 


Consider the following problem: we are given a data set g = {(x i; j/j) }^L 1 , obtained by sampling, 
with noise, some unknown function /(x) and we are asked to recover the function /, or an 
approximation of it, from the data g. A common strategy consists of choosing as a solution the 
minimum of a functional of the following form: 

m =e v( Vi - /(*))+ am (i) 

i=i 


where V (. x ) is some loss function used to measure the interpolation error, a is a positive number, 
and <F[/] is a smoothness functional. SVMR correspond to a particular choice for V, that is 
Vapnik’s ILF, plotted below in figure (1): 


V(x ) = | a; 


0 if |x| < e 

hi — e otherwise. 


( 2 ) 


(Details about minimizing the functional (1) and the specific form of the smoothness functional 
(1) can be found in [7] and [2]). Vapnik’s ILF is similar to some of the functions used in 
robust statistics (Huber, 1981), which are known to provide robustness against outliers and was 
motivated by it (see Vapnik, 1998)i [8]. However the function (2) is not only a robust cost 
function, because of its linear behavior outside the interval [—e, e], but also assigns zero cost to 
errors smaller then e. In other words, for the cost function V € any function closer than e to the 
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Figure 1: Vapnik’s ILF VJx). 


data points is a perfect interpolant. 

It is important to notice that if we choose V(x) = x 2 , then the functional (1) is the usual 
regularization theory functional [10, 3], and its minimization leads to models which include Radial 
Basis Functions or multivariate splines. Vapnik’s ILF represents therefore a crucial difference 
between SVMR and more classical models such as splines and Radial Basis Functions. What is 
the rationale for using Vapnik’s ILF rather than a quadratic loss function like in regularization 
theory? In the next section we will introduce a Bayesian framework that will allow us to answer 
this question. 

3 Bayes approach to SVMR 

In this section, a simple Bayesian framework is used to interpret the variational approach of the 
type eq. (1). Rigorous work on this topic was originally done by Kimeldorf and Wahba, and we 
refer to [5, 10, 3, 6] for details. 

Suppose that the set g = {(xj, ?/*) e R n x R}^ of data has been obtained by randomly sampling 
a function /, defined on R n , in the presence of additive noise, that is 

/(Xi)=s/i + 5i, (3) 

where S t are random independent variables with a given distribution. We want to recover the 
function /, or an estimate of it, from the set of data g. We take a probabilistic approach, 
and regard the function / as the realization of a random held with a known prior probability 
distribution. We are interested in maximizing the a posteriori probability of / given the data g, 
which can be written, using Bayes’ theorem, as following: 

V[f\g)<xV[g\f]V[f], (4) 

where V[g\f] is the conditional probability of the data g given the function / and V[f] is the 
a priori probability of the random held /, which is often written as V[f] cc where $[/] 

is usually a smoothness functional. The probability V[g\f] is essentially a model of the noise, 
and if the noise is additive, as in eq. (3), and i.i.d. with probability distribution P(S), it can be 
written as: 

N 

p{s\f\ = n pw <5) 

4=1 
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Substituting eq. (5) in eq. (4), it is easy to see that the function that maximizes the posterior 
probability of / given the data g is the one that minimizes the following functional: 

N 

i= 1 

This functional is of the same form as equation (1), once we identify the loss function V(x) as 
the log-likelihood of the noise. If we assume that the noise in eq. (3) is Gaussian, with zero mean 
and variance a, then the functional above takes the form: 

1 N 

H \f I = £( y * “ /( x *)) 2 + <*$[/] • 

which corresponds to the classical regularization theory approach [10, 3]. In order to obtain 
SVMR in this approach one would have to assume that the probability distribution of the noise 
is P(8) = e - ^ 6 . Unlike an assumption of Gaussian noise, it is not clear what motivates in this 
Bayesian framework such a choice. The next section will address this question. 

4 Main Result 

In this section we build on the probabilistic approach described in the previous section and on 
work done by Girosi [1], and derive a novel class of noise models and loss functions. 

4.1 A novel noise model 

We start by modifying eq. (5), and drop the assumption that noise variables have all identical 
probability distributions. Different data points may have been collected at different times, under 
different conditions, so it is more realistic to assume that the noise variables Si have probability 
distributions Pi which are not necessarily identical. Therefore we write: 

N 

-pw] = nm)- ( 7 ) 

1=1 

Now we assume that the noise distributions Pi are actually Gaussians, but do not have necessarily 
zero mean, and define P t as: 

Pi(Si) oc e -^ 5i ~ ti)2 ( 8 ) 

While this model is realistic, and takes into account the fact that the noise could be biased, 
it is not practical because it is unlikely that we know the set of parameters (3 = {/3i}f =1 and 
t = However, we may have some information about (3 and t, for example a range for 

their values, or the knowledge that most of the time they assume certain values. It is therefore 
natural to model the uncertainty on (3 and t by considering them as i.i.d. random variables, 
with probability distributions V((3,t) = fl^Li P(Pu U)- Under this assumption, eq. (8) can be 
interpreted as P(Si\/3i, U), the conditional probability of Si given j3i and U. Taking this in account, 
we can rewrite eq. (4) as: 

N 

V[f\g,M^UPm,U)v\f]- ( 9 ) 

i =1 
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Since we are interested in computing the conditional probability of / given g, independently of 
(3 and t, we compute the marginal of the distribution above, integrating over (3 and t: 

N 

p*[/i 9 ] x d/3 dt n ( 10 ) 

■' ■' i =1 

Using the assumption that j3 and t are i.i.d., so that V((3,t) = fl^i we can easily see 

that the function that maximizes the a posteriori probability V*[f\g\ is the one that minimizes 
the following functional: 

N 

n[f] = e n/w - vd + am in) 

i= 1 

where V is given by: 

V{x) = - log jT° d/3 f dtjpe- 13 ^ 2 P(P, t ) (12) 

where the factor appears because of the normalization of the Gaussian (other constant factors 
have been disregarded). Equation (11) defines a novel class of loss functions, and provides a 
probabilistic interpretation for them: using a loss function V with an integral representation of 
the form (12) is equivalent to assuming that the noise is Gaussian, but the mean and the variance 
of the noise are random variables with probability distribution P(/3,t). 

The class of loss functions defined by equation (12) is an extension of the model discussed by 
Girosi [1] where the case of unbiased noise distributions is considered: 

V(x) = - log dpyfpe-^ 2 P{p) (13) 

Equation (13) can be obtained from equation (12) by setting P(/3,t ) = P(/3)S(t). The class of 
loss function of type (13) can be completely identified. To this purpose observe that given V 
the probability function P(/3) is essentially the inverse Laplace transform of e v ^\ So V{x) 
verify equation (13) whenever the inverse Laplace transform of exp—^^(y^) is nonnegative and 
integrable. In practice this is difficult if not impossible to check. However alternative conditions 
that guarantee V to be in the class (13) can be employed (see [1] and references there). A simple 
example of loss function of type (13) is V(x) = |x| a , a G (0,2], When a = 2 this corresponds to 
the classical quadratic loss function which solves equation (13) with P(/3) = S(t). The case a = 1 
corresponds to the robust Li error measure and equation (13) is solved by: P(/3 ) = f3~ 2 exp(—j-j) 
as can be seen computing the inverse Laplace transform of e^. 

4.2 The noise model for Vapnik’s ILF 

In order to provide a probabilistic interpretation for Vapnik’s ILF we need to find a probability 
distribution P(/3,t ) such that eq. (12) is verified when we set V(x ) = |x| e . This is a difficult 
problem, which requires the solution of a linear integral equation. Here we state a solution under 
the assumption that f3 and t are independent variable, that is: 

P(t,P)=fx((3)X(t) (14) 

Plugging equation (14) in equation (12) and computing the integral with respect to (3 we obtain: 
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( 15 ) 


r+oo 

e ~ v ( x ) — / dt\(t)G{x — t ) 

J — oo 

where we have defined: 


G(t) = J™ dp n(p)Jpe-P t2 (16) 

Observe that the function G is a density distribution, because both the function in the r.h.s. of 
equation (16) are densities. 

In order to compute G we observe that for e = 0 the function e~^ e becomes the Laplace 
distribution which we have seen above be in the class (13). Then \ e = 0 (t) = S(t) and from 
equation (15): 


G(t) = e~ w . (17) 

Observe also that in view of the example discussed at the end of section (4.1) and of equation (17) 
the function v{P) in equation (16) is: 


P(P) = p~ 2 e~ (18) 

It remains to obtain the expression of A (t) for e > 0. To this purpose we write equation (15) in 
Fourier space: 


with: 


F[e 


and: 


F[e~^) = G{u)X e (u) 

_ N = sin (ecu) + w cos (ecu) 
cu(l + UJ 2 ) 

GM = 


l+O J 2 

Plugging equations (20) and (21) in equation (19) we obtain: 

~ . . sineuo 

A e (o;) =-b coseu>. 

to 

Finally taking the inverse Fourier Transform and normalizing: 


(19) 


( 20 ) 


( 21 ) 


( 22 ) 


K(t) — ~ F ~(x[-.,.](0 + — e ) + + + (23) 

where \ is the characteristic function of the interval [—e, e] 

/ +0O 

dt\pt)e- lt ~ xl (24) 

-CX) 

The shape of the functions in eq. (18) and (23) are shown in figure (2). The above model has 
a simple interpretation: using Vapnik’s ILF is equivalent to assuming that the noise affecting 
the data is Gaussian. However, the variance and the mean of the Gaussian noise are random 
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Figure 2: a) The probability distribution P(cr), where a 2 = and P(P) is given by eq. 18 ; b) 
The probability distribution \ € (x) for e = .25 (see eq.23). 


variables: the variance (u 2 = has a unimodal distribution that does not depend on e, and the 
mean has a distribution which is uniform in the interval [—e, e], (except for two delta functions 
at q=e, which ensures that the mean has not zero probability to be equal to =pe). The distribution 
of the mean is consistent with the current understanding of Vapnik’s ILF: errors smaller than e 
do not count because they may be due entirely to the bias of the Gaussian noise. 

Finally observe that equation (15) establishes an indirect representation of the density e _v as a 
superposition of translates of the density G, where the coefficients are given by the distribution 
of the mean A (t), the length of translation being in the interval [—e, e]. 


5 Additional Results 


While it is difficult to give a simple characterization of the class of loss functions with an integral 
representation of the type (12), it is possible to extend the results of the previous section to a 
particular sub-class of loss functions, ones of the form: 


V e (x) 


h(x) 

if x < e 

w 

otherwise, 


(25) 


where h(x) is some symmetric function with some restriction that will become clear later. A 
well known example is one of Huber’s robust loss functions [4], for which h(x) — |—K f (see 
figure (3.a)). For loss functions of the form (25), it can be shown that a function P((3,t ) that 
solves eq. (12) always exists, and it has a form which is very similar to the one for Vapnik’s ILF. 
More precisely, we have that P(/3,t ) = P(/3)X e (t), where P(P) is given by eq. (18), and A e (t) is 
the following compact-support distribution: 


Am f m-P"(t) if |t| < e 
} 0 otherwise, 


(26) 


where we have defined P{x ) = e~ Ve ^ x \ This result does not guarantee, however, that A e is a 
measure, because P{t ) — P ( t ) may not be positive on the whole interval [—e, e], depending on h. 
The positivity constraint defines the class of “admissible” functions h. A precise characterization 
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Figure 3: a) The Huber loss function; b) the corresponding A e (x), e = .25. Notice the difference 
between this distribution and the one that corresponds to Vapnik’s ILF: while for this one the 
mean of the noise is zero most of the times, in Vapnik’s ILF all the values of the mean are equally 
likely. 


of the class of admissible h, and therefore the class of “shapes” of the functions which can be 

derived in this model is currently under study. It is easy to verify that the Huber’s loss function 

described above is admissible, and corresponds to a probability distribution for which the mean 

1 2 

is equal to A e {t) — (1 + A — (A) 2 )e - 2 i over the interval [—e, e] (see figure (3.b)). 

Finally note that for the class of loss functions (25) the noise distribution can be written as the 
convolution between the distribution of the mean A e and the Laplace distribution: 

P e (x) = J +e dt\ e (t)e~ I*"*' (27) 

Equation (27) establishes a representation of the noise distribution P as a continuos superposition 
of translate Laplace functions in the interval [—e, e]. 

6 Conclusions and future work 

In this work an interpretation of Vapnik’s ILF was presented in a simple Bayesian framework. 
This will hopefully lead to a better understanding of the assumptions that are implicitly made 
when using SVMR. We believe this work is important for at least two reasons: 1) it makes more 
clear under which conditions it is appropriate to use SVMR rather than regularization theory; and 
2 ) it may help to better understand the role of the parameter e, which appears in the definition 
of Vapnik’s loss function, and is one of the two free parameters in SVMR. We demonstrated 
that the use of Vapnik’s ILF is justified under the assumption that the noise affecting the data 
is additive and Gaussian, but not necessarily zero mean, and that its variance and mean are 
random variables with given probability distributions. Similar results can be derived for some 
other loss functions of the “robust” type. A clear characterization of the class of loss functions 
which can be derived in this framework is still missing, and it is subject of current work . While 
we present this work in the framework of SVMR, similar reasoning can be applied to justify 
non-quadratic loss functions in any Maximum Likelihood or Maximum A Posteriori approach. 
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